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0 Multiprocessor cache system. 

@ The bandwidth of tine data transfer among a 
main memory and snoopy caches is improved by 
solving the bus neck in a multiprocessor system 
using a snoopy cache technique. Shared bus cou- 
pling is employed for an address/ command bus 5 
requiring bus snoop whereas multiple data paths 
coupled by an interconnection network 7 are used 
for the data bus not requiring bus snoop. The mul- 
tiple data paths 7 reflect the order of the snoopy 
operations on the order of data transfer such as to 
maintain data consistency among the caches. 
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This invention relates to multiprocessor sys- 
tems and. more particularly, to multiprocessor sys- 
tems having a plurality of processors provided with 
respective private caches and having a shared 
memory space. 

Conflicts in access to a shared memory is the 
most serious bottleneck that prevents improvement 
of the system performance in a multiprocessor 
system of a shared memory type. In order to 
lighten the bottleneck, techniques using additional 
private caches provided for respective processors 
and thereby decreasing the required bandwidth for 
the shared memory are often used. Further a tech- 
nique for maintaining the consistency of data 
among the additional caches, or "snoopy cache" 
technique is well known. In this technique, each 
cache always monitors memory access which oc- 
curs on the shared bus (the "shared bus" herein 
means a communication medium to which a plural- 
ity of resources are connected and which is con- 
currently shared by these resources), and performs 
appropriate operations, if necessary, to a corre- 
sponding cache block for maintenance of the con- 
sistency of data in^terms of other caches and the 
main memory. Such consistency operations are 
implemented in hardware. This technique is ex- 
cellent because the maintenance control of data 
consistency is performed easily and at a high 
speed, and it is accordingly widely adopted. How- 
ever, the "snoopy cache" technique cannot resolve 
one major problem, i.e., bus neck because it is 
based on a shared bus architecture, and it is ac- 
cordingly practical for only small-scaled parallel 
systems including, at maximum, ten or so proces- 
sors. 

On the other hand, as a technique for solving 
the bus neck problem, an interconnection network 
(the "interconnection network" herein means a 
communication medium to which a plurality of re- 
sources are connected and which connects them 
by one to one. or by one to some, by means of a 
switch) has been studied for a long time. In a 
multiprocessor system coupled by an interconnec- 
tion network, the number of coupling links in- 
creases with the number of processors constituting 
the system. Therefore, the interconnection network 
technology ensures a transfer bandwidth which is 
proportional to the number of processors, and 
makes it possible to realize a targe-scaled parallel 
system including hundreds of processors. How- 
ever, it is impossible for each private cache added 
to each processor to monitor all memory access by 
other processors. Therefore, it is theoretically im- 
possible for such a system to perform control of 
data consistency by hardware implementing the 
"snoopy cache" technique. Under thes circum- 
stances, it is usual to give up consistency control 
by hardware but rely on software to perform con- 



sistency control. In this approach, caches are con- 
trolled by software so that copi s of the same 
memory address are never possessed concurrently 
by a plurality of caches. More specifically, under 
5 control of software protocol, corresponding copies 
in caches are invalidated by software instructions at 
an appropriate time to ensure that only one cache 
possesses the copy at any point of time. 
Drawbacks of this technique are the increase in 

10 load imposed by the software and the decrease in 
performance caused by static invalidation by soft- 
ware instead of dynamically optimizing the use of 
caches by hardware. 

There has been proposed a technique related 

76 to the present invention, a technique combining a 
snoopy bus and an interconnection network 
(Bhuyan, L. N.; Bao Chyn Liu; Ahmea, I. "Analysis 
of MIN based multiprocessors with private cache 
memories." Proceedings of the 1989 International 

20 Conference on Parallel Processing, 8th to 12th 
August, 198i9, pp. 51-58). In this technique, a 
snoopy bus is provided in addition to an intercon- 
nection network. Memory access that requires 
communication among caches for control -of data — 

25 consistency is processed through the snoopy bus, 
and normal memory access that does not require 
communication among caches is processed 
through the interconnection network. In order to 
decide whether the communication among the 

30 caches is required, a table storing conditions of all 
shared copies in the system is added to each 
cache. In this technique, the upper limit of the 
transfer bandwidth is detennined by either that of 
the shared bus used for access to shared data or 

35 that of the interconnection network used for access 
to particular data, selected depending on which is 
saturated earlier. Therefore, the upper limit of the 
transfer bandwidth in this technique largely de- 
pends on the characteristics of a program to be 

40 executed. It is reasonable to consider that, in a 
multiprocessor system using a snoopy cache tech- 
nique well designed so as to significantiy decrease 
the cache miss ratio, a large fraction of the whole 
access requests occurring on the system bus 

45 would be access requests generated by commu- 
nication among caches for control of data consis- 
tency. Therefore, this technique merely realizes a 
transfer bandwidth several times wider than the 
bandwidth realized by the shared bus coupling 

50 technique. This technique also requires that each 
cache should have a management table that de- 
scribes conditions of the entire system in order to 
make it possible to locally determine whether ac- 
cess using the shared bus is required or only 

55 access using the interconnection network is re- 
quired. In addition, the control mechanism of this 
technique becomes complicated because it must 
control both the shared bus and the interconnection 
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network by using the table. 

This invention provides a multiprocessing sys- 
tem comprising a plurality of processors; a main 
memory divided into a plurality of modules; a plu- 
rality of cache memories provided for the proces- 
sors; shared bus means coupled to the cache 
memories for transferring data address information 
to the cache memories; control means provided for 
said cache memories for monitoring said address 
information transferred through the shared bus 
means to perform data consistency procedures; 
and interconnection network means for selectively 
interconnecting the cache memories and the mem- 
ory modules, on the basis of the address informa- 
tion, for data transfer therebetween. 

This invention has been made in view of the 
aforementioned circumstances, and enables the 
bus neck of the "snoopy cache technique" based 
on the shared bus coupling to be removed, by 
using simple hardware (control mechanism) without 
using software on which multiprocessor systems 
coupled through the interconnection network would 
^ have_relied for data consistency niaintenance. 

According to the invention, in a tightly coupled 
multiprocessor system having a plurality of proces- 
sors provided with respective private caches and 
having shared memory space, and employing the 
snoopy cache technique for maintaining the data 
consistency among the caches, the interconnection 
network structure can be introduced without any 
adverse affection to the snoopy cache technique, 
and significant increase in the transfer bandwidth of 
the memory bus can be achieved. 

Embodiments of the invention will now be de- 
scribed, by way of example only, with reference to 
the accompanying drawing, wherein:- 

Fig. 1 is a block diagram showing a general 
arrangement of a multiprocessor of a shared- 
bus, shared memory and snoopy cache type; 
Fig. 2 is a block diagram showing an embodi- 
ment of the invention; 

Fig. 3 is a block diagram showing an example of 
data path switch used in the embodiment; 
Rg. 4 is a timing chart of memory access and 
bus snoop of the embodiment; 
Fig. 5 is a timing chart of memory access and 
bus snoop in a multiprocessor using a conven- 
tional snoopy cache technique; and 
Fig. 6 is a block diagram of the data path switch 
for explaining the operation of an alternative 
example. 

Fig. 1 shows a multiprocessor system using 
the "snoopy cache" technique. In Rg. 1 , a plurality 
of proc ssors la to 1n are connected to a shared 
bus 3 and a shared memory 4 via their respective 
private caches 2a to 2n. Each of the private caches 
2a to 2n monitors memory acc ss occurring on the 
shared bus 3, and maintains data consistency 



among plural caches by executing an appropriate 
operation, if necessary, to a corresponding cache 
block. That is, what is necessary in the snoopy 
cache technique are that all of the caches monitor 
5 address/commands on the shared bus and that the 
order of the snoop is reflected to the order of data 
transfer sufficiently to maintain data consistency 
among caches (this is realized without any addi- 
tional means because data is transferred through 

70 the bus in the order of the snoop). Accordingly, 
monitoring the data bus itself is not necessary. On 
the other hand, in recent high-speed microproces- 
sors, which often use a cache line size as long as 
64 bytes or more, such a long cache line is block- 

75 transferred on the system bus having a limited bit 
width by using a plurality of bus cycles (for exam- 
ple, 8 bytes x 8 cycles). That is, the 
address/command cycle required for the bus snoop 
is quite a short period of 1 or 2 bus cycles; 

20 nevertheless, the system bus is occupied for a 
significantly long time for transferring the long 
cache line. The technique according to the inven- 
tion positively considers the above-mentioned two 
facts, and uses ~ the shared bus coupling for the — ~ 

25 address/command bus requiring the bus snoop, but 
uses multiple data paths coupled by the intercon- 
nection network for the data bus not requiring the 
bus snoop. The multiple data paths, however, must 
reflect the order of the snoop to the order of data 

30 transfer sufficiently to maintain data consistency 
among the caches. According to this approach, 
while the snoopy cache technique is logically ap- 
plied in its entirety, the interconnection network can 
be utilized for increasing the transfer bandwidth. 

35 An embodiment of the invention is explained 

below with reference to Rgs. 2 and 3. 

In Fig. 2. snoopy caches 2a to 2n provided for 
respective processors 1a to In are coupled to- 
gether by a single address/command bus 5. 

40 Snoopy operation for cache consistency control is 
performed through the address/command bus 5. 
On the other hand, data path 6a to 6n from the 
respective snoopy caches 2a to 2n are coupled to 
a shared memory system which consists of a plu- 

45 rality of interleaved memory modules 9a to 9m, via 
a data path switch 7 and data paths 8a to 8m. 

Rg. 3 shows an example of the data path 
switch 7 in which m and n of Fig. 2 are both eight, 
respectively, that is, processors la to In and mem- 

50 ory modules 9a to 9m are 8 sets, respectively. 
Multiplexers 10a to lOh select data paths 8a to 8h 
of the memory modules 9a to 9h, and connect 
them to the data paths 6a to 6h of the snoopy 
caches 2a to 2h. Multiplexers 11a to 11h select the 

55 data paths 6a to 6h of the snoopy caches 2a to 2h, 
and issue them to the data paths 8a to 8h of the 
memory modul s 9a to 9h. A data path control! r 
12 controls the multiplexers 10a to lOh and 11a to 
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I1h on the basis of address and command on the 
address/command bus 5. and establishes a data 
path necessary for data transfer. 

Next explanation is directed to a data transfer 
method between a cache and a memory and 
among caches with reference to Rgs. 2 and 3. 
Assume here that the cache line is 8 times wider 
than the data bus width and the cache lines are 
interleaved into the memory modules 9a to 9h in 
such a way that cache lines are stored in order of 
the youth of their address into the memory mod- 
ules 9a, 9b,... 9h, 9a,.. For example, the nth, (n + 1)- 
th, and (n + 7)th cache lines are sequentially 
stored in the memory modules 9a, 9b, and 9h. 
In this case, respective parts of the address are 
used as follows: 

A) Several least significant bits will designate 
respective bytes of data having the data bus 
width. They are ordinarily transferred in a de- 
coded form as byte enable. 

B) Three subsequent less significant bits will 
designate the location of the data having the 
data-bus width in Jhe cache line. 

C) Three subsequent less significant bits will 
designate a memory module in which the cache 
line is located. 

D) The remaining more significant bits will des- 
ignate the location of the cache line in the 
memory module. 

Therefore, predetermined several most signifi- 
cant bits are used for changeover of the multiplex- 
ers 10a to 10h. 

1) When data are read from the memory into a 
cache: 

Consider the case where the snoopy cache 
2d reads data from the memory module 9d. 
Using the address/command issued on the ad- 
dress command bus 5 by the snoopy cache 2d, 
the data path controller 12 controls the mul- 
tiplexer lOd and connects the data path 8d to 
the data path 6d. Through this data path (shown 
by a dotted line in the left half portion of Fig.3), 
data in the memory module 9d are read into the 
snoopy cache 2d in 8 bus cycles. 

2) When data are written from a cache into a 
memory: 

Consider the case where the snoopy cache 
2d writes data into the memory module 9d. 
Using the address/command issued on the 
address/command issued on the address com- 
mand bus 5 by the snoopy cache 2d, the data 
path controller 12 controls the multiplexer lid 
and connects the data path 6d to th data path 
8d. Through this data path (shown by a dotted 
line in the right half portion of Fig.3) from the 
snoopy cache 2d are written into the memory 
modui 9d in 8 bus cycles. 

3) Transfer of data among caches: 



Data transfer from a cache to a cache is 
effected by writing data from a cache into an 
associated memory module and reading out 
them again. 

5 Fig. 4 is a timing chart which shows how to 

multiplex the memory access and the bus snoop. 
The abscissa indicates the bus cycle by taking bus 
cycles 1 to 10 as an example. This example shows 
that access to a certain memory address occurred 
10 in the bus cycle 1 , and the snoopy operation itself 
in all the caches has finished in the bus cycle 1 
alone, but a long cache line is being block-trans- 
ferred by using 8 bus cycles 2 to 9. If access to a 
different memory module occurs in the subsequent 

76 bus cycle 2, the processing therefor is started 
immediately. The snoopy operation finishes in the 
bus cycle 2 alone, and the cache line is transferred 
from the memory system to the cache that re- 
quests it by using 8 bus cycles 3 to 10. Hereafter, 

20 repetition of such conditions is shown. Therefore, 
under the practical operational conditions, the ef- 
fective bandwidth is determined by both the con- 
tention on the address/command bus and the con- 
tention on the memory module; however, under the ~ 

25 ideal operational conditions shown in Fig. 4, the 
theoretical maximum value is determined by a 
snoopy cycle and a cache line size as easily un- 
derstood from the expression in the above timing 
chart. For example, assuming that the snoopy cy- 

30 cle is 40ns (25 MHz), when the data bus width is 8 
bytes and the cache line size is 64 bytes, the 
upper limit of the realizable bus bandwidth is 1.6G 
bytes/second. Note that the timing chart of the 
case using the conventional "snoopy cache" tech- 

35 nique is as shown in Fig. 5 in which the upper limit 
of the realizable bus bandwidth under the same 
conditions is 200M bytes/second. 

Three alternative examples are described be- 
low. One of them uses various interconnection net- 

40 works other than a multiplexer as a data path 
switch. For example, a crossbar, omega network or 
the like may be used. However, as described t)e- 
fore, the multiple data paths must be such that the 
order of the snoop is reflected to the order of data 

45 transfer much enough to maintain data consistency 
among caches. 

The second is a technique which inci'eases the 
velocity of data transfer from one cache to another. 
This is explained with reference to Figs. 2 and 3. 

50 Consider the case where data corresponding to the 
memory module 9d are transferred from the 
snoopy cache 2a to the snoopy cache 2d. First, the 
data path controller 12 controls the multiplexer lid 
and connects the data path 6a to the data path 8d. 

55 At the same time, the data path controller 12 con- 
trols the multiplexer lOd and connects th data 
path 8d to the data path 6d (shown by a dotted line 
in the left half portion of Fig.3). Then the data path 
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6a is connected to the data path 6d, and data 
corresponding to the memory module 9d can be 
transferred from the snoopy cache 2a to the 
snoopy cache 2d. This modification makes it possi- 
ble to transfer data at a speed two times higher 
than the aforementioned method, that is, transfer- 
ring data from one cache to another by reading 
data from the one cache to a memory module and 
afterward writing from the memory module to the 
another cache. 

The final is a technique that changes corre- 
spondence of the cache lines with the memory 
modules. Here again, let the length of the cache 
line be 8 times wider than the data bus. However, 
assume that 8 data piece having the bus width 
constitutes a single cache line and are Interleaved 
into the memory modules 9a to 9h in such a way 
that the data pieces are stored In order of the 
youth of their addresses in to the memory modules 
9a, 9b,.., and 9h. For example, data D1, D2..., and 
D7 having the data bus width of a single cache line 
are sequentially stored in the memory modules 9a, 

9b,.., -and 9h. The address at this time Js used as ^ 

follows: 

A) Several least significant bits will designate 
respective bytes of a data bus width data piece. 
They are ordinarily transferred in a decoded 
form as byte enable. 

B) Three subsequent less significant bits will 
designate the memory module in which the data 
bus width data piece is located. At the same 
time, they will designate the location of the data 
bus width data piece in the cache line. 

C) The remaining more significant bits will des- 
ignate the location of the cache line in the 
memory module. 

A data transfer method from the memory to a 
cache, from a cache to the memory and from one 
cache to another under this situation is explained 
with reference to Rgs. 2 and 6. Although Fig. 8 is 
the same as Rg. 3 except that data paths shown 
by dotted lines are different. 

1) When data are read from memory into cache: 
Data are always read from all memory mod- 
ules. Consider the case where the snoopy cache 
2d reads the address/command issued on the 
address/command bus 5. and the data path con- 
troller 12 controls the multiplexer 10d and con- 
nects first the data path 8a to the data path 6d. 
Through this data path (the leftmost one of the 
paths shown by dotted lines in the left half 
portion Fig. 6). data from the memory module 
9a are read into the snoopy cache 2d. In the 
next bus cycle, the data path controller 12 con- 
trols the multiplexer lOd and connects the data 
path 8b to the data path 6d. Through this data 
path (th second one from the left end of paths 
shown by dotted lines in the left half portion of 



Fig. 6), data from the memory module 9b are 
read into the snoopy cache 2d. Similarly, data 
from the memory modules 9c to 9h are read 
into the snoopy cache 2d. 
5 2) When data are written from cache into mem- 
ory: 

Data are always written into all memory 
modules. Consider the case where the snoopy 
cache 2a reads the address/command issued on 
10 the address/command bus 5, and data path con- 
troller 12 controls the multiplexer lla and con- 
nects first the data path 6a to the data path 8a. 
Through this data path (the leftmost one of 
paths shown with dotted lines in the right half 
75 portion of Fig. 6), data from the snoopy cache 
2a are written into the memory module 9a. In 
the next bus cycle, the data path controller 12 
cancels connection of the preceding cycle, con- 
trols the multiplexer lib, and connects the data 
20 path 6a to the data path 8b. Through this data 
path (the second one from the left end of paths 
shown with dotted lines in the right half portion 
_ . . _ _pf Fig. 6), data from the^snoopy cache 2a are 
written into the memory^module' 9b. SimilarlyT 
25 data from the snoopy cache 2a are written into 
the memory modules 9c to 9h. 
3) Data transfer from one cache to another: 

Data transfer from one cache to another is 
attained by writing data from a cache Into an 
30 associated memory module and by reading 
them again. This technique permits reading data 
to be started from a cycle subsequent to the 
bus cycle in which writing data is started. In 
addition, data transfer can also be effected by 
35 simultaneously establishing the writing data path 
and the reading data path as done by the sec- 
ond altemative example. 
The use of this technique gives such an advan- 
tage that currently continued memory access and 
40 subsequently commenced memory access never 
conflict on the memory modules and that the aver- 
age bus bandwidth increases. However, since this 
technique requires memory of the same access 
time as the bus cycle, the memory system be- 
45 comes very expensive if existing semiconductor 
memory is utilized. 

The bandwidth of the data transfer among a 
main memory and snoopy caches is improved by 
solving the bus neck in a multiprocessor system 
50 using a snoopy cache technique. Shared bus cou- 
pling is employed for an address/ command bus 5 
requiring bus snoop whereas multiple data paths 
coupled by an interconnection network 7 are used 
for the data bus not requiring bus snoop. The 
55 multiple data paths 7 reflect the order of the 
snoopy operations on the order of data transfer 
such as to maintain data consistency among the 
caches. 
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Claims 

1. A multiprocessor data processing system com- 
prising a plurality of processors (1); a main 
memory divided into a plurality of modules (9); 5 
a plurality of cache memories (2) provided for 

the processors; shared bus means (5) coupled 
to the cache memories (2) for transferring data 
address information to the cache memories; 
control means provided for the cache memo- to 
ries for monitoring the address information 
transferred through the shared bus means to 
perform data consistency procedures; and in- 
terconnection network means (7) for selectively 
interconnecting the cache memories and mem- 75 
ory modules, on the basis of the address in- 
formation, for data transfer therebetween. 

2. A system as claimed in claim 1 wherein said 
main memory is divided into said modules in 20 
accordance with addresses. 

~3.~"The system-as claimedHn-claim-1-wherein-said 

memory is divided into said modules in such a 
manner that segments of a transfer data unit 25 
for said cache memories are distributed to said 
modules. 

4. A system as claimed in any preceding claim 
wherein data transfer among said cache 30 
memories is performed by data transfer from a 
source cache memory to said main memory 

and data transfer from said main memory to a 
destination cache memory. 

35 

5. A multiprocessor system comprising a plurality 
of processors; a main memory divided into a 
plurality of modules; a plurality of cache 
memories provided for said processors; inter- 
connection network means for selectively con- 40 
necting said cache memories to said modules 

and to said cache memories on the basis of 
address information of transfenred data; shared 
bus means coupled to said cache memories 
for transfen^lng said address Information to said 45 
cache memories; and control means provided 
for said cache memories for monitoring said 
address information transferred through said 
shared bus means to perform a desired proce- 
dure for consistency of stored data. so 

6. A data transmission apparatus used in a mul- 
tiprocessor system having a plurality of pro- 
cessors, cache memories provided for said 
proc ssors, and a main memory which is dl- 55 
vided into a plurality of modules, comprising 
interconnection network means for selectively 
connecting said cache memories to said mod- 



ules of said main memory on the basis of 
address information of transferred data; and 
shared bus means coupled to said cache 
memories in order for transferring said address 
information to said cache memories. 

7. A multiprocessor system comprising a plurality 
of processors (1); a main memory divided into 
a plurality of modules (4); a plurality of cache 
memories (2) provided for the processors (1); 
interconnection network means (7) for selec- 
tively connecting said cache memories (2) to 
said modules on the basis of address informa- 
tion of transferred data; shared bus means (5) 
coupled to said cache memories for transfer- 
ring said address information to said cache 
memories; and control means provided for said 
cache memories for monitoring said address 
information transferred through said shared 
bus means to perform a desired procedure for 
consistency of stored data. 
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MAXIMUM BUS BANDVIUm = BUS CUXX x CACHE LINE SIZE 
= 25MHz X 64BYIES = 1.6G BYIES/SECOND 
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MAXIMUM BUS BAmflUm = BUS CLOCK x CACHE LINE SIZE 
= 25MHz x 8BYIES = 200M BYIES/SECOND 
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